Quantitative prediction method

ABSTRACT

The present invention concerns methods and systems for analysis of drug resistance in HIV-1. More specifically, the invention provides methods for predicting drug resistance by correlating genotypic information with phenotypic profiles. The methods allow the identification of primary and secondary resistance-associated mutations for new and existing drugs and for calculating the contribution of mutations and combinations of mutations to resistance and hyper-susceptibility. The invention allows the design, optimization and assessment of the efficiency of a therapeutic regimen based upon the genotype of the disease affecting a patient

The present invention concerns methods and systems for analysis of drugresistance in HIV-1. More specifically, the invention provides methodsfor predicting drug resistance by correlating genotypic information withphenotypic profiles. The methods allow the identification of primary andsecondary resistance-associated mutations for new and existing drugs andfor calculating the contribution of mutations and combinations ofmutations to resistance and hyper-susceptibility. The invention allowsthe design, optimization and assessment of the efficiency of atherapeutic regimen based upon the genotype of the disease affecting apatient.

This application claims priority benefit of EP patent application nr.03101687.6, and of U.S. Provisional Application No. 60/478,780 filed onJun. 16, 2003, the contents of which are expressly incorporated byreference herein. All other publications, patents and patentapplications cited herein are incorporated in full by reference.

BACKGROUND

Techniques to determine the resistance of HIV-1 to a therapeutic agentare becoming increasingly important Many patients experience treatmentfailure or reduced efficacy over time. This is generally due to thevirus mutating and/or developing a resistance to the treatment. As usedherein, “HIV” is the human immunodeficiency virus, which is aretrovirus.

The various different anti-HIV-1 agents that have been developed overthe years were initially administered to patients alone, as monotherapy.Though a temporary antiviral effect was observed, all the compounds losttheir effectiveness over time. Research has now demonstrated that one ofthe main reasons behind treatment failure for all the antiviral drugs isthe development of resistance of the virus to the drug (see, forexample, Larder et al., 1989, Science, 246, 1155-8). This is largely dueto the ability of HIV continuously to generate a number of geneticvariants in a replicating viral population. These genetic changesgenerally alter the configuration of the HIV reverse transcriptase (RT)and protease (PR) molecules in such a way that they are no longersusceptible to inhibition by compounds developed to target them. Ifantiretroviral therapy is ongoing and if viral replication is notcompletely suppressed, the selection of genetic variants is inevitableand the viral population becomes resistant to the drug.

Since then, dual combination therapy, using drugs that target both HIVreverse transcriptase (RT) and protease (PR) molecules, has providedincreased control of viral replication, and thus provided extendedclinical benefit to patients. In recent years, however, it has becomeclear that even patients being treated with triple therapy including aprotease inhibitor often eventually experience treatment failure.

Since patients in the developed world are generally prescribed cocktailsof therapeutic drugs, not all HIV-1 infections originate with a wildtype but with drug sensitive strains, from which drug resistanceinevitably emerges. As such, with the increase in prevalence of drugresistant strains, there comes an increase in infections that actuallybegin with drug resistant strains. Infections with pre-existing drugresistance immediately reduce the drug options for drug treatment andemphasize the importance of drug resistance information to optimizeinitial therapy for these patients.

Moreover, as the number of available antiretroviral agents hasincreased, so has the number of possible drug combinations andcombination therapies. It is therefore very difficult, if notimpossible, for the physician to establish the optimal combination foran individual. Although there are many drugs available for use incombination therapy, the choices can quickly be exhausted and thepatient can rapidly experience clinical progression or deterioration ifthe wrong treatment decisions are made. The key to tailored,individualized therapy lies in the effective profiling of the individualpatient's virus population in terms of sensitivity or resistance to theavailable drugs. This requires the advent of truly individualizedtherapy.

There are certain solutions to this problem currently in use.

Phenotyping directly measures the actual sensitivity of a patient'spathogen or malignant cell to particular therapeutic agents. However,this can be slow, labor-intensive and thus expensive.

A second approach to measuring resistance involves genotyping tests thatdetect specific genetic changes (mutations) in the viral genome whichlead to amino acid changes in at least one of the viral proteins, knownor suspected to be associated with resistance. Although genotyping testscan be performed more rapidly, a problem with genotyping is that thereare now over 100 individual mutations with evidence of an effect onsusceptibility to HIV-1 drugs and new ones are constantly beingdiscovered, in parallel with the development of new drugs and treatmentstrategies. The relationship between these point mutations, deletionsand insertions and the actual susceptibility of the virus to drugtherapy is extremely complex and interactive. An example of thiscomplexity is the M184V mutation that confers resistance to 3TC butreverses AZT resistance. The 333D/E mutation, however, reverses thiseffect and can lead to dual AZT/3TC resistance.

Sophisticated interpretation is therefore required to predict what thenet effect of these mutations might be on the susceptibility of thevirus population to the various therapeutic agents. Custom algorithmssuch as rules-based computer algorithms have provided some assistance,for example, see International patent application WO01/79540. Anoverview of this type of technique is presented in FIG. 1.

Beerenwinkel et al., PNAS (June 2002), 99(12), pp 8271-8276; Schmidt etal., AIDS (August 2000), 14(12), pp 1731-1738; and Sevin et al., Journalof Infectious Diseases, (July 2000), 182(1), pp 59-67; disclose methodsfor quantitating the individual contribution of a mutation orcombination of mutations to the drug resistance phenotype exhibited byHIV based on different algorithms such as, respectively, decision trees,a rule-based approach and statistical analyses such as cluster analysis,recursive partitioning, linear discriminant analysis. In Schmidt et al.,AIDS Reviews, 4(3), pp 148-156, further methods are reviewed.

Meisel et al., Therapeutic Drug Monitoring (February 2001), 23(1), pp9-14; and Meisel et al., Pharmacogenetics (1997), 7(3), pp 241-246;disclose a method for predicting the metabolic activity phenotype fromthe mutation pattern of the NAT-2 gene by multiple linear regressionanalysis. The linear regression model describes a quantity R_(s), themetabolic ratio built up by an error term (first term) and a sum ofproducts built up from a mutation factor multiplied by amutation-dependent resistance coefficient. However, given the nature ofthe NAT-2 genotypic patterns, the above methods do not consider therelationship between point mutations within a genotypic pattern. Inparticular, the quantitative prediction methods proposed are merely anaddition of independent variables where effects such as antagonism orsynergy between point mutations, insertions or deletions are not takeninto account.

There remains a continuing need for the quantitative prediction of HIVdrug susceptibility from viral genotype. In particular, there is a needfor quantitative prediction methodologies like linear regressionmodelling which can grasp the complexity of the HIV-1genotypic-to-phenotypic dynamics, i.e. combinatorial effects such asantagonism and synergism. Furthermore, because the majority of HIVpatients have now been exposed to drug cocktails, it is thought that thedisease-causing retroviruses tend to spontaneously generate mutationsthat have often co-evolved. This makes the analysis of which mutationsare responsible for resistance to which drugs almost impossible usingcurrently available techniques. It also means that mutations thatcontribute to resistance are being overlooked using the currentlyavailable analysis techniques.

It is therefore an aim of the present invention to provide methods forimproving the interpretation of genotypic results.

It is a further aim of the invention to provide methods for determining(or predicting) a phenotype based on a genotype.

It is also a further aim of the invention to provide methods forpredicting the resistance of an HIV variant of a particular genotype toa therapy or a therapeutic agent.

It is also an aim of the invention to predict resistance of a patient totherapy.

It is also an aim of the invention to provide methods to assess theeffectiveness or efficiency of a therapy or to optimize a patient'stherapy.

It is also an aim of the invention to identify novel HIV-1 mutationsthat are associated with resistance to particular drug therapies orcombination therapies.

SUMMARY OF THE INVENTION

A solution to these problems involves new methods for measuring drugresistance by correlating genotypic information with phenotypic drugresistance profiles measured experimentally.

According to a first aspect of the invention, there is provided a methodfor quantitating the individual contribution of a mutation orcombination of mutations to the drug resistance phenotype exhibited byHIV, said method comprising the steps of:

a) performing a linear regression analysis using data from a dataset ofmatching genotypes and phenotypes, whereby the log fold resistance, pFR,is modelled as the sum of all the individual resistance contributionsfor each of the mutations or combinations of mutations that occur in HIVaccording to the following equation;pFR=β _(A) M _(A)+β_(B) M _(B)+ . . . +β_(Z) M _(Z)+εwherein each individual resistance contribution is calculated bymultiplying a mutation factor, M_(A), M_(B), . . . , M_(Z), for eachmutation or combination of mutations by a resistance coefficient, β_(A),β_(B), . . . , β_(Z);wherein the mutation factor assigned to each mutation or combination ofmutations reflects the degree to which that mutation or combination ofmutations is present in the HIV strain and, if present, to which degreethe mutation is present in a mixture;wherein each resistance coefficient reflects the contribution of themutation or combination of mutations to the fold resistance exhibited bythe strain;and wherein the error term ε, represents the difference between themodelled prediction and the experimentally determined measurement.

This method involves a data driven technique for quantitative drugsusceptibility prediction. This method uses a multiple linear regressionmodel to estimate coefficient values that accurately reflect thecontribution made by a particular HIV mutation or combination ofmutations to resistance to a particular drug. Repeating the method foreach candidate therapeutic drug allows the compilation of a globalpicture of drug resistance exhibited by a particular HIV strain.

This method has allowed the identification of mutations hithertounrecognized as having an effect on drug resistance in HIV. The methodalso allows the identification of primary (single mutations) andsecondary (the co-occurrence of two mutations) or higher order termsresistance-associated mutations for new and existing drugs, whichembrace the antagonistic and synergistic phenomena. Accordingly, afurther aspect of the invention provides a method of identifying amutation that affects the degree of drug resistance exhibited by an HIVstrain using a method according to the first aspect of the invention.

The method of the first aspect of the invention is also advantageousover current methods since it allows the quantitative, purelydata-driven, objective-assessment of the contribution of mutations andcombinations of mutations to drug resistance. The method also allows thede-convolution of the individual contribution made by particularmutations to the drug resistance phenotype. Unlike existing methods, themethod is able to correct for correlating mutations that on the face ofit appear to affect drug resistance, but which in fact only correlate intheir occurrence with resistance causing mutations and are themselvesphenotypically silent.

The method has allowed the design of an automated computationaltechnique for the prediction of the drug resistance profile possessed bya particular HIV strain infecting a patient. The methods thus allow thedetermination of a patient phenotype without having to perform anyphenotypic testing whatsoever. This has clear ramifications for thebespoke design, optimisation and assessment of strategies for individualpatient therapy based upon the genotype of the infecting agent.

The invention also provides diagnostic kits for performing each of themethods of the invention described herein.

DESCRIPTION

In any population of HIV variants, there is a wide distribution of drugresistance phenotypes for any particular drug, ranging fromhyper-susceptibility to strong resistance (see FIG. 2). The expression“drug resistance phenotype” means the resistance of an HIV virus to atested therapy, therapeutic agent or drug. The term “resistance” as usedherein, pertains to the capacity of resistance, sensitivity,susceptibility, hyper-susceptibility or effectiveness of a therapyagainst a disease. The term “therapy” includes but is not limited to adrug, pharmaceutical, or any other compound or combination of compoundsthat can be used in therapy or therapeutic treatment of HIV. Thisdistribution of drug resistance reflects the large number of differentgenotypes that are present in the population. Some variants may onlyhave one mutation that is correlated with drug resistance, whilst otherswill have several or numerous such mutations, each of which may impartits own contribution to the drug resistance phenotype.

Adding an additional level of complication are the phenomena ofantagonism, synergy and enhancement, where certain mutations may add toor detract from the effect of other mutations in a manner notpredictable from studying the effects of the individual mutations alone.Highly correlated mutations are also problematic. These are mutationsthat almost always co-occur in a strain, but only one of the mutationsactually has an effect on drug resistance. For example, when one ofthese 2 mutations has an effect on resistance and the other mutationdoes not (this mutation might for example be highly correlated with theresistance mutation because it affects the replication rate of thevirus), the effect can erroneously be assigned to either one of themutations.

Examples of mutations known or suspected to influence the sensitivity ofHIV to drug therapy may be found on the internet athttp://hiv-web.lanl.gov; http://hivdb.stanford.edu/hiv/; orhttp://www.viral-resistance.com.

In HIV, two sections of the genome are generally studied: Protease (PR)and Reverse Transcriptase (RT). The methods of the present invention canequally be applied to other sections of the HIV genome such as Integrase(IN). A mutation is presented as a number referring to the position inthe protein, followed by the amino acid(s) on that position, if itdiffers from the amino acid in the HXB2 HIV reference. In the termsincluded above, the mutations are represented as “A”, “B”, . . . , “Z”.

Mixtures reflect the diversity of the HIV population in a sample. Itmeans that on that position two subsets of the population have adifferent amino acid. Mixtures are denoted by separating amino acidswith the ‘/’ character: 65K/R (mixture of ‘K’ and ‘R’ at position 65).

When more than two amino acids are found on a certain position insubsets of the population, the dummy amino acid ‘X’ is used.

Insertions are denoted by adding the insert position behind a dot: 69.2S(an insert of ‘S’ at insert position 2). Deletions are denoted by aminus sign: 69−.

Examples of mutations present in the RT domain of HIV conferringresistance to a reverse transcriptase inhibitor include 69C, 69V, 69T,75A, 1011, 103T, 103N, 184T, 188H, 190E, 219N, 219Q, 221Y, 2211, and233V. Additional examples of mutations present in the protease (PR)domain of HIV conferring resistance to a reverse transcriptase inhibitorinclude 24M, 48A, and 53L. A mutation may affect resistance alone or incombination with other mutations.

For the purposes of the invention, the mutations identified should beassociated with resistance or susceptibility to drug therapy, forexample an antiretroviral drug. The degree to which a particularmutation pattern may affect resistance may be determined by one of skillin the art, for example, using the phenotypic resistance monitoringassay such as, the ANTIVIROGRAM® (Virco, Belgium) (see WO97/27480). Inthis methodology, resistance is determined with respect to a laboratoryreference strain HIVLAI/IIIB. The difference in IC₅₀ (the concentrationof drug required to reduce the virus' growth in cell culture by 50%)between the patient sample and the reference viral strain is determinedas a quotient. This fold change in IC₅₀ is reported and indicative ofthe resistance profile of a certain drug. Based on the changes in IC₅₀,cut-off values have been established to distinguish a sample from beingsensitive or resistant to a certain drug.

Various projects are underway to compile data relating to thecorrespondence of certain mutations with drug resistance phenotype, andthese generally lead to the generation of relational databases of tablesthat illustrate the matching genotype / resistance phenotype for variousantiretroviral drugs. Such databases bring together the knowledge ofboth a genotypic and phenotypic database. The phenotypic databasecontains phenotypic resistance values for HIV to at least one therapy,preferably multiple drug therapies. For example, the phenotypicresistance values of tested HIV viruses, with a fold resistancedetermination compared to the reference HIV virus (wild type).

The dataset used herein is a dataset developed by the Applicant, whichconsists of a set of matching genotype / phenotype measurements withpossible multiple phenotype measurements per genotype. However, anysimilar dataset may be used, provided that there are sufficient entriesfor each genotype / phenotype measurement for the data to besignificant. In the Virco dataset, the mutations are defined relative toHXB2 at amino acid level.

The phenotypes are presented as pFR values, where pFR is equal to −log(FR), where FR denotes the Fold Resistance. Negative pFR values thusdenote resistance and positive values denote hyper-susceptibility. Forexample, a pFR value of −1.0 is equal to 10-fold resistance. An exampleof the pFR distribution for Saquinavir (SQV) is shown in FIG. 3. FIG. 4shows the pFR distribution for the “48V” mutation on SQV. It is clearfrom this that the 48V subset does not behave the same as the wholedataset.

The problems of unwanted correlations between mutations where not allcorrelated mutations contribute to the drug resistance phenotype areillustrated in FIG. 6. Here, the left hand panel shows the pFRdistribution for the 71I mutation. When the effects of mutations 48V and84V are removed (right hand panel), the pFR in the distribution ofvariants is markedly increased (less drug resistance).

According to the invention, the predicted fold resistance of an HIVstrain of a particular genotype may be calculated by summing theindividual resistance contributions for each of the mutations orcombinations of mutations in the mutation pattern of that genotype. Themethod uses linear regression models, so that the phenotype prediction,pFR is calculated in the following equation (1):pFR=β _(A) M _(A)+β_(B) M _(B)+ . . . +β_(Z) M _(Z)+εThe independent variables M_(A), M_(B), . . . , M_(Z), are referred toherein as mutation factors, each of which reflects the degree to whichthe mutation or combination of mutations is present in the HIV strainand, if present, whether or not the mutation is present in a mixture.

The resistance coefficients, β_(A), β_(B), . . . , β_(Z) represent thecontribution to the total pFR prediction for each single mutation.

Each mutation factor M_(i) thus represents the presence or absence ofthe corresponding mutation and each coefficient, β_(i) represents thecontribution to the pFR change for that specific mutation.

The mutation factor may take into account 1^(st) order terms (singlemutations) as well as 2^(nd) order terms (the co-occurrence of twomutations) and in general no order terms. For example, 2^(nd) orderterms take the form:β_(AB)M_(AB)The independent variable M_(AB) represents the co-occurrence ofmutations A and B and the coefficient, β_(AB) represents the synergy orantagonism between mutations A and B. When the mutation factor embracesn^(th) order terms, thus the co-occurrence of two ore more mutations,the terms take the form:β_(n)M_(n)wherein M_(n) represents the co-occurrence of one mutation with otherone or more mutations—such as duplets, triplets, quadruples, etc. andthe coefficient β_(n) represents the synergy or antagonism between theone mutation with the other one or more mutations; thus n will apply forinstance to the combinations of mutations, AB, ABC, ABCD, BC, ACD, etc.

As such, the linear regression model may take the following equation:pFR=β_(A) M _(A)+β_(B) M _(B)+β_(n) M _(n)+ . . . +β_(Z) M _(Z)+ε

Higher order terms affect for interactions between mutations:

-   -   antagonism or reversal: positive pFR shifts for mutation        couples.    -   synergy or enhancement extra negative pFR shift for mutation        couple.

For example, consider the following (artificial) model: MutationCoefficient 84V −0.46 50V −0.92 54M −0.64 88S 0.63 90M −0.16 46L −0.1946L & 84V −0.09

Consider a virus with following mutations: 3I, 46L, 84V and 90M.Applying equation (1), this virus will have a pFR prediction:pFR=β_(46L).1+β_(84V).1+β_(90M).1+β_(46L,84V).1=−0.9or almost 8-fold resistance, i.e. pFR=−logFR.

Note that in the model F and A are synergistic since their co-occurrencedecreases the pFR by an extra−0.09.

The error term is ε, which is the difference between the prediction andthe measurement. This error term contains both the measurement error onthe phenotype measurement and a model error (if the underlying model hashigher order terms that are not taken into account in the regressionmodel).

Mutation factors for single mutations (M_(A), M_(B), . . . , M_(Z)) arecalculated as follows:

if the mutation is present in the HIV strain, a positive mutation factoris assigned;

if the single mutation is not present, the mutation factor assigned iszero;

if the single mutation is present in a mixture, an averaged positivemutation factor is assigned.

Conveniently, mutation factors range between 0 and 1 where 0 means notpresent and 1 means present. Values between 0 and 1 means that themutation is present in a mixture. Accordingly, a positive mutationfactor is assigned the value 1.

Mixtures are modelled as causing the average shift of its constituentmutations. Since methods for the quantitation of the precise proportionsof mixtures to wild type are expensive and time-consuming, mixture withwild type may conveniently be treated as causing half the pFR shift ofthe resistance mutation (mutation factor=0.5). However, as the skilledreader will appreciate, a more precise mutation factor may be assignedif the true proportion in the mixture is known.

Mutation factors for double mutations (M_(AB) etc.) are calculated asfollows;

if both the mutations are present in the HIV strain, a positive mutationfactor is assigned (conveniently, the value 1);

if neither of the mutations are present, the mutation factor assigned iszero;

if both mutations are present and one mutation is present in a mixture,an averaged positive mutation factor is assigned (conveniently, 0.5);

if both mutations are present in a mixture, a reduced averaged positivemutation factor is assigned (in this example, 0.25). The factor 0.25 isthe product of the M-factors of both the single constituent mutations.This is the result of the assumption that these mixtures are independentof each other. Of course, this is an approximation, since in a realblood sample, the mixtures are not independent of each other. Forexample, if only 2 viruses were present, virus A (no mutations) for 70%and virus B (mutations 46I and 84V) for 30%, then a mixture would bedetected on both positions 46 and 84. If these concentrations wereknown, it would be possible to fine tune the mutation factor of 0.25. Ifthis information is not available, the best statistical guess is0.5*0.5: this being the average value that would be measured for themutation couple being present for a population of samples that havethese mixtures on 46 and 84 in all possible concentrations.

Similarly, the mutation factors for triple mutations (M_(ABC) etc.)shall be calculated as follows:

if the three mutations are present in the HIV strain, a positivemutation factor is assigned (conveniently, the value 1);

if neither of the three mutations are present, the mutation factorassigned is zero;

if the three mutations are present and one mutation is present in amixture, an averaged positive mutation factor is assigned (conveniently,0.5);

if the three mutations are present and two mutations are present in amixture, a reduced averaged positive mutation factor is assigned (inthis example, 0.25). The factor 0.25 is the product of the M-factors ofthe two single constituent mutations present in the mixture;

if the three mutations are present in a mixture, a reduced averagedpositive mutation factor is assigned (in this example, 0.125). Thefactor 0.125 is the product of the M-factors of the three singleconstituent mutations.

The calculation of the mutation factors for higher order terms shalltake the same principle.

Calculation of the resistance coefficient (β_(A), β_(B), . . . , β_(Z),β_(AB)) is performed by evaluating the dataset for the drug phenotypereported for each mutation or combination of mutations.

The problem of unwanted correlations has been discussed above. Unwantedcorrelations are removed according to the methods of the invention.

One way to do this is to use an algorithm that has been developed by theinventors to track the change in pFR as the effects of individualmutations or combinations of mutations are removed from the dataset. Theeffect of each mutation or combination of mutations is thus separatedout. The methodology follows mutation trajectories towards the globalaverage as the effects of individual mutations or combinations ofmutations are removed. The steps are as follows:

-   a) calculate average pFR for all mutations with a sufficient count    in the database to be significant;-   b) determine the extremes (maximum, minimum), and select the    mutation with the pFR furthest away from the global average;-   c) remove all virus strains that have the selected mutation and    reiterate from step a);-   d) stop when the selected mutation in step b) has an average pFR    that approximates to the global average.

In this manner, mutations that do not cause resistance, but which areoften present with mutations that do cause resistance will have a higheraverage pFR (less resistance). Removing the virus strains with a certainresistance causing mutation results in an increase of the average pFRfor correlating mutations.

A suitable threshold at which a count in the database becomessufficiently significant will be apparent to the skilled reader and willbe dependent on the database size. For example, thresholds of 5, 10, 15,20, 25, 30 or more may be suitable. In the examples discussed herein, athreshold of 20 times was used.

By an “average pFR that approximates to the global average” is meantthat the average pFR is within a fraction of the standard deviation ofthe remaining population. A convenient fraction ranges between about 0.3and 0.5.

A comparison of the change in the global average pFR with the change inthe average pFR for selected mutations with increasing iterations of thealgorithm is shown in FIG. 7. FIG. 8 shows an example, where the averagepFR for 71I (unwanted correlation) jumps up as a result of removing fromthe dataset virus strains that have “71I & 84V” and “48V” mutations.

An alternative, analogous methodology for removing unwanted correlationsis as follows; this is an extension of the mutation trajectoriesalgorithm discussed above. The steps of this method are as follows:

-   a) calculate correlation coefficient between all mutations (with a    sufficient count in the database) and the pFR;-   b) determine the extremes (maximum, minimum), and select the    mutation with the highest (absolute value of) correlation    coefficient;-   c) calculate a linear model for the pFR with the selected    mutation(s) (from step b), all previous iterations);-   d) take the residue (pFR minus the predicted value from the model);-   e) calculate correlation coefficient between all mutations (with a    sufficient count in the database) and the residue;-   f) determine the extremes (maximum, minimum), and select the    mutation with the highest (absolute value of) correlation    coefficient;-   g) calculate a linear model for the pFR with the selected    mutation(s) (from step f), all previous iterations); and-   h) reiterate from step d);-   i) stop when the selected mutation in step g) has a correlation    coefficient that approximates to zero.

As with the mutation trajectories algorithm described above, the effectof mutations that do not themselves cause resistance, but which areoften present with mutations that do cause resistance, is excluded andthus does not distort the real values.

In a more preferred methodology for removing unwanted correlations, astepwise selection regression may be applied, which method selects thevariable with the highest effect. The steps of this method are asfollows:

-   a) perform a first order regression from the list of mutations that    occur in the dataset;-   b) calculate the p-value for all mutations;-   c) select the mutation with the lowest p-value and add it to the    model;-   d) re-calculate the regression model;-   e) reiterate from step b);-   f) stop when the re-calculation of the p-values of step b) gives no    significant values anymore.

Usually, this methodology is run in statistical software packages, whichiteratively model the residue from the previous regression as thedependent variable.

The p-value for a given mutation is the probability of rejecting thetrue null hypothesis, where the null hypothesis is defined as follows:the coefficient of that parameter equals zero. In other words, thep-value is the probability that the real coefficient for a certainparameter is zero, while the model predicts a coefficient different fromzero.

The expression “significant value” for the p-values refers to themutations with a p-value that is lower than the threshold selected bythe user. This threshold may be determined as follows: the user shallcreate linear models for a whole range of combinations of p-values (ap-value for the first-order iteration and a p-value for the second-orderiteration). For each combination, the mean squared error (MSE) of thecorresponding model is calculated on unseen data. The combination ofp-values that results in the model with the lowest MSE, i.e. thecombination for which the model gives the best predictions, is chosen.

In further preferred embodiments of the invention, problems of smalldatasets for particular mutations or combinations of mutations are dealtwith by applying the method recursively to the set of virus strains thatexhibit those particular mutations or combinations of mutations.

In still further preferred embodiments of the invention, the followingadditional correlations are taken into account:

-   -   multiple entries of the same virus strain (or virus strains        grown from the same stock solution) that cause unwanted        correlations;    -   censored values in genotype / phenotype database (for example,        EC₅₀ value=‘<1 μM’). These are phenotypes beyond the assay        range, thus when the phenotypic value is smaller than the        measurable range, a ‘<’-censor is applied to that value.        Analogously, a ‘>’-censor is applied to the value, if it is        higher than the measurable range.

Preferably, censored values are dealt with by attempting to construct amodel that is consistent from extrapolations. Censored values are thusmodeled by replacing the censored value by a maximum likelihoodestimation, assuming knowledge of the standard deviation of themeasurement error.

A preferred technique for the generation of a maximum likelihoodestimation is as follows:

-   a) calculate a linear regression model without censored values;-   b) use the phenotypic measured value V₀ as if the censor was “=”,    e.g. when a result is expressed as −log FR<4, we will treat V₀ as    −log FR=4;-   c) look at the prediction P from the model and apply either:    -   Case ‘<’-censor:        -   P<V₀−0.798 σ (center of gravity of half Gaussian            distribution)            -   Remove value from training data for next iteration                V₀−0.798 σ≦P<V₀            -   Use V′=V₀−0.798 σ for next iteration        -   V₀≦P            -   Use V′ centre of gravity of tail (<V) of a normal                distribution N (P, σ) as value for next iteration.    -   Case ‘>’-censor:        -   P>V₀+0.798 σ (center of gravity of half Gaussian            distribution)            -   Remove value from training data for next iteration        -   V₀+0.798 σ≧P>V₀            -   Use V′=V₀−0.798 σ for next iteration        -   V₀≧P            -   Use V′ centre of gravity of tail (>V) of a normal                distribution N (P, σ) as value for next iteration.-   d) calculate a linear regression model and for the censored values    in the linear regression model, either remove the data-point from    the training set, or use V′ instead of the censored phenotypes    measurement, as described in step c);-   e) re-iterate from step b) until the prediction converges.

Accordingly, for each iteration, when the prediction and measurementcontradict, censored values are taken into account. When the predictionand measurement are strongly consistent, censored values aredisregarded, on the basis that no further information is provided andtheir inclusion has no additional value.

In one preferred embodiment of this aspect of the invention, the numberof calculations necessary in the linear regression analysis may bereduced. The computational power and memory requirement that iscurrently generally available is insufficient to allow a full secondorder model to be evaluated for a large dataset, based on all possiblesingle mutations and second order terms, since the number of termsincreases quadratically with the number of mutations considered. Thisnumber increases with a larger dataset since more rare mutations are ina large database.

In order to reduce the amount of terms, a first order regression may beperformed from the list of mutations that occur in the dataset above athreshold number of times. A suitable threshold at which a count in thedatabase becomes sufficiently significant will be apparent to theskilled reader and will be dependent on the database size. For example,thresholds of 5, 10, 15,20, 25, 30 or more maybe suitable. In theexamples discussed herein, a threshold of 20 times was used. Thesignificant terms from this first order regression are withheld and thelist of these terms is then used to perform a second order regression.In the second order regression only the single mutations andcombinations of mutations are used that were found significant in thefirst order model.

Again, a threshold significance will be apparent to the skilledreader—an example is if the probability that the real value of the termis 0, is smaller than 0.001.

For example, a first order regression performed on the matching genotype/ phenotype dataset for indinavir (34,445 measurements) for thosemutations that occur at least 20 times results in a first order modelthat withholds a list of 94 single mutations that are consideredsignificant.

This list is then used as a starting list for a second order regression.It should be noted that it may be advantageous to exclude certain verycommon mutations from the calculation. 3I is one example. The reason isthat a mutation must occur at least a threshold number of times and theinverse also has to be true: the count of viruses not having themutation 3I or the couple not 3I and another mutation should also beabove the threshold value (e.g. 20). Taking this into account results inexcluding 3I from the regression in practice.

In the second order regression, all the single mutations and all couplesof mutations from the list are used as potential terms. The significantterms are then withheld by the regression algorithm.

According to a further aspect of the invention, there is provided amethod of calculating the quantitative contribution of a mutationpattern to the drug resistance phenotype exhibited by an HIV strain,said method comprising the steps of:

a) obtaining a genetic sequence of said HIV strain;

b) identifying the pattern of mutations in said genetic sequence,wherein said mutations are associated with resistance or susceptibilityto drug therapy; and

c) calculating the fold resistance of the HIV strain as compared to thewild type HIV strain by performing a linear regression analysis, wherebythe log fold resistance, pFR, is modelled as the sum of all theindividual resistance contributions for each of the mutations orcombinations of mutations that occur in said HIV strain according to thefollowing equation;pFR=β_(A) M _(A)+β_(B) M _(B)+ . . . +β_(Z) M _(Z)+εwherein each individual resistance contribution is calculated bymultiplying a mutation factor, M_(A), M_(B), . . . , M_(Z), for eachmutation or combination of mutations by a resistance coefficients β_(A),β_(B), . . . , β_(Z);wherein the mutation factor assigned reflects the degree to which themutation or combination of mutations is present in the HIV strain and,if present, to which degree the mutation is present in a mixture;wherein each resistance coefficient reflects the contribution of themutation or combination of mutations to the fold resistance exhibited bythe strain;and wherein the error term ε, represents the difference between themodelled prediction and the experimentally determined measurement.

As the skilled reader will appreciate, the fold resistance of the HIVstrain may be calculated using any one of the embodiments of theinvention referred to above.

In the first step of this method, the genetic sequence of an HIV strainshould be obtained. Normally, this will be the genetic sequence of anHIV strain with which a patient is infected, although the sequence maybe a theoretical sequence, for example for purposes of in silicomodelling.

The method may thus be used as a diagnostic method for predicting thefold resistance exhibited by a particular HIV strain with which apatient is infected. According to other preferred embodiments, themethod may be used for assessing the efficiency of a patient's therapyor for evaluating or optimising a therapy. The method may be performedfor each drug or combination of drugs currently being administered tothe patient so as to obtain a series of drug resistance phenotypes andthus to assess the effect of a plurality of drugs or drug combinationson the predicted fold resistance exhibited by the HIV strain with whichthe patient is infected.

A “patient” may be any organism, particularly a human or other mammal,suffering from HIV or AIDS or in need or desire of treatment for suchdisease. A patient includes any mammal and particularly humans of anyage or state of development.

To obtain an HIV strain from a patient, a biological sample will need tobe obtained from the patient. A “biological sample” may be any materialobtained in a direct or indirect way from a patient containing HIVvirus. A biological sample may be obtained from, for example, saliva,semen, breast milk, blood, plasma, faeces, urine, tissue samples, mucoussamples, cells in cell culture, cells which may be further cultured,etc. Biological samples also include biopsy samples.

The genetic sequence of an HIV strain may be evaluated by a number ofsuitable means, as will be clear to those of skill in the art. Mostsuitable will be techniques that allow for specific nucleic acidamplification, such as the polymerase chain reaction (PCR), althoughother techniques such as restriction fragment length polymorphism (RFLP)analysis will be equally applicable.

Nucleic acid sequencing then allows the analysis of the mutation patternin a particular nucleic acid sequence, either by classical nucleicsequencing protocols e. g. extension chain termination protocols (Sangertechnique; see Sanger F., Nicher., Coulson A. Proc. Nat. Acad. Sci.1977, 74, 5463-5467) or chain cleavage protocols. Such methods mayemploy such enzymes as the Klenow fragment of DNA polymerase I,Sequenase (US Biochemical Corp, Cleveland, Ohio), Taq polymerase (PerkinElmer), thermostable T7 polymerase (Amersham, Chicago, Ill.), orcombinations of polymerases and proof-reading exonucleases such as thosefound in the ELONGASE Amplification System marketed by Gibco/BRL(Gaithersburg, Md.). Preferably, the sequencing process may be automatedusing machines such as the Hamilton Micro Lab 2200 (Hamilton, Reno,Nev.), the Peltier Thermal Cycler (PTC200; MJ Research, Watertown,Mass.) and the ABI Catalyst and 373 and 377 DNA Sequencers (PerkinElmer). Particular sequencing methodologies have been developed furtherby companies such as Visible Genetics. Any of the novel approachesdeveloped for unraveling the sequence of a target nucleic acid, eithernow or in the future will be perfectly applicable to the analysis ofsequence in the present invention (including but not limited to massspectrometry, MALDI-TOF (matrix assisted laser desorption ionizationtime of flight spectroscopy) (see Graber J, Smith C., Cantor C. Genet.Anal. 1999, 14, 215-219) chip analysis (hybridization based techniques)(Fodor S P; Rava R P; Huang X C; Pease A C; Holmes C P; Adams C L Nature1993, 364, 555-6) It should be appreciated that nucleic acid sequencingcovers both DNA and RNA sequencing.

Once the genetic sequence of the HIV strain is known, the pattern ofmutation must be identified in the sequence. The term “mutation” as thisis used herein, encompasses both genetic and epigenetic mutations of thegenetic sequence of wild type HIV. A genetic mutation includes, but isnot limited to, (i) base substitutions: single nucleotide polymorphisms,transitions, transversions, substitutions and (ii) frame shiftmutations: insertions, repeats and deletions. Epigenetic mutationsinclude, but are not limited to, alterations of nucleic acids, e. g.,methylation of nucleic acids. One example includes (changes in)methylation of cytosine residues in the whole or only part of thegenetic sequence. In the present invention, mutations will generally beconsidered at the level of the amino acid sequence, and comprise, butare not limited to, substitutions, deletions or insertions of aminoacids.

The “control sequence” or “wild type” is the reference sequence fromwhich the existence of mutations is based. A control sequence for HIV isHXB2. This viral genome comprises 9718 bp and has an accession number inGenbank at NCBI M38432 or K03455 (gi number: 327742).

Identifying a mutation pattern in a genetic sequence under test thusrelates to the identification of mutations in the genetic sequence ascompared to a wild type sequence, which lead to a change in nucleicacids or amino acids or which lead to altered expression of the geneticsequence or altered expression of the protein encoded by the geneticsequence or altered expression of the protein under control of saidgenetic sequence.

A “mutation pattern” comprises at least one mutation influencingsensitivity of HIV to a therapy. As such, a mutation pattern may consistof only one single mutation. Alternatively, a mutation pattern mayconsist of at least two, at least three, at least four, at least five,at least six, at least seven, at least eight, at least nine or at leastten or more mutations. A mutation pattern is thus a list or combinationof mutations or a list of combinations of mutations. A mutation patternof any particular genetic sequence may be constructed, for example, bycomparing the tested genetic sequence against a wild type or controlsequence. The existence of a mutation or the existence of one of a groupof mutations can then be noted.

One way in which this may be done is by aligning the genetic sequenceunder test to a wild type sequence noting any differences in thealignment. Typical alignment methods include Smith-Waterman (Smith andWaterman, (1981) J Mol Biol, 147: 195-197), Blast (Altschul et al.(1990) J Mol Biol., 215(3): 403-10), FASTA (Pearson & Lipman, (1988)Proc Natl Acad Sci USA; 85(8): 2444-8) and, more recently, PSI-BLAST(Altschul et al. (1997) Nucleic Acids Res., 25(17): 3389-402). It may insome circumstances be preferable to generate alignments using a multiplealignment program, such as ClustalW (Thompson et al., 1994, NAR, 22(22),4673-4680). Other suitable methods will be clear to those of skill inthe art (see also “Bioinformatics: A practical guide to the analysis ofgenes and proteins” Eds. Baxevanis and Ouellette, 1998, John Wiley andSons, New York). A practical example of multiple sequence alignment isthe construction of a phylogenetic tree. A phylogenetic tree visualizesthe relationship between different sequences and can be used to predictfuture events and retrospectively to devise a common origin. This typeof analysis can be used to predict a similar drug sensitivity for asample but also can be used to unravel the origin of different patientsample (i. c. the origin of the viral strain).

In this manner, therefore, the pattern of mutations in the geneticsequence can be identified, wherein said mutations are associated withresistance or susceptibility to drug therapy exhibited by the HIV straintested. The mutation pattern may influence sensitivity to a specifictherapy, e. g., a drug, or a group of therapies. The mutation patternmay, for example, increase and/or decrease resistance of the HIV strainto a therapy. Particular mutations in the mutation pattern, may also,for example, enhance and/or decrease the influence of other mutationspresent in the genetic sequence that effect sensitivity of the HIVstrain to a therapy.

The invention further relates to a diagnostic system as herein describedfor use in any of the above described methods. An example of such adiagnostic system, for quantitating the individual contribution of amutation or combination of mutations to the drug resistance phenotypeexhibited by an HIV strain, comprises:

a) means for obtaining a genetic sequence of said HIV strain;

b) means for identifying the mutation pattern in said genetic sequenceas compared to wild type HIV;

c) means for predicting the fold resistance exhibited by the HIV strainusing any one of the methods described above.

The means for predicting the fold resistance are preferably computermeans.

A still further aspect of the invention relates to a computer apparatusor computer-based system adapted to perform any one of the methods ofthe invention described above, for example, to quantify the individualcontribution of a mutation or combination of mutations to the drugresistance phenotype exhibited by HIV, or to calculate the quantitativecontribution of a mutation pattern to the drug resistance phenotypeexhibited by an HIV strain.

In a preferred embodiment of the invention, said computer apparatus maycomprise a processor means incorporating a memory means adapted forstoring data; means for inputting data relating to the mutation patternexhibited by a particular HIV strain; and computer software means storedin said computer memory that is adapted to perform a method according toany one of the embodiments of the invention described above and output apredicted quantified drug resistance phenotype exhibited by an HIVstrain possessing said mutation pattern.

A computer system of this aspect of the invention may comprise a centralprocessing unit; an input device for inputting requests; an outputdevice; a memory; and at least one bus connecting the central processingunit, the memory, the input device and the output device. The memoryshould store a module that is configured so that upon receiving arequest to quantify the individual contribution of a mutation orcombination of mutations to the drug resistance phenotype exhibited byHIV, or to calculate the quantitative contribution of a mutation patternto the drug resistance phenotype exhibited by an HIV strain, it performsthe steps listed in any one of the methods of the invention describedabove.

In the apparatus and systems of these embodiments of the invention, datamay be input by downloading the sequence data from a local site such asa memory or disk drive, or alternatively from a remote site accessedover a network such as the internet. The sequences may be input bykeyboard, if required.

The generated results may be output in any convenient format, forexample, to a printer, a word processing program, a graphics viewingprogram or to a screen display device. Other convenient formats will beapparent to the skilled reader.

The means adapted to quantify the individual contribution of a mutationor combination of mutations to the drug resistance phenotype exhibitedby HIV, or to calculate the quantitative contribution of a mutationpattern to the drug resistance phenotype exhibited by an HIV strain willpreferably comprise computer software means. As the skilled reader willappreciate, once the novel and inventive teaching of the invention isappreciated, any number of different computer software means may bedesigned to implement this teaching.

According to a still further aspect of the invention, there is provideda computer program product for use in conjunction with a computer, saidcomputer program comprising a computer readable storage medium and acomputer program mechanism embedded therein, the computer programmechanism comprising a module that is configured so that upon receivinga request to quantify the individual contribution of a mutation orcombination of mutations to the drug resistance phenotype exhibited byHIV, or to calculate the quantitative contribution of a mutation patternto the drug resistance phenotype exhibited by an HIV strain, it performsthe steps listed in any one of the methods of the invention describedabove.

The invention further relates to systems, computer program products,business methods, server side and client side systems and methods forgenerating, providing, and transmitting the results of the abovemethods.

The invention will now be described by way of example with particularreference to a specific algorithm that implements the process of theinvention. As the skilled reader will appreciate, variations from thisspecific illustrated embodiment are of course possible without departingfrom the scope of the invention.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1: Overview of measured /predicted phenotype handling

FIG. 2: Phenotype distribution of ritonavir for matching G/P samples inVirco database

FIG. 3: pFR distribution for saquinavir

FIG. 4: Distribution of pFR for ‘48V’ mutation on saquinavir

FIG. 5: Distribution of pFR for ‘48V’ mutation on saquinavir (expanded)

FIG. 6: Removing unwanted correlations

FIG. 7: global mutation trajectories

FIG. 8: mutation trajectories for 71I

FIG. 9: Example of genotypes, mutations relative to HBX2

FIG. 10: Example of phenotype analysis for ritonavir

FIG. 11: Higher order interaction between mutations 82A and 84V

FIG. 12: Illustration of iterative procedure for censored values

FIG. 13: Linear regression model identifies mutations included in IASlist. Mutations marked with an * are also identified by a regression ona 5% subset of the data

FIG. 14: Linear regression model identifies additional mutationspreviously described in the literature

FIG. 15: Predicted versus measured log(FC)

FIG. 16: Comparison between linear regression model and decision trees.

FIG. 17: Histogram of population left after removing all virus strainsduring the iterations

FIG. 18: Trajectory of mutation 18H

FIG. 19: Residues as a function of the measured values

FIG. 20: Histogram of the residues as a function of the measured values

FIG. 21: Histogram of the residues as a function of the measured valuesafter 6 parameters were taken into account

EXAMPLES Example 1 Methodology

1.1 Introduction

This exercise involved the generation of a list of key mutations foreach of the following drugs: Indinavir, Ritonavir, Saquinavir,Nelfinavir, Amprenavir, Lopinavir, Zidovudine, Didanosine, Zalcitabine,Stavudine, Abacavir, Lamivudine, Tenofovir, Nevirapine, Delavirdine andEfavirenz.

The obtained list of key mutations is derived from a linear regressionmodel using single mutations and couples of mutations as independentvariables. The dataset used for this analysis is an export of the Vircodataset at 2003/02/01 from the vircomining tables. Table 1 shows thematching geno/pheno counts for each drug (each phenotype measurement fora genotype counts as one measurement). TABLE 1 matching geno/phenocounts Drug Count Drug Count Drug Count Amprenavir 29,508 Lamivudine34,395 Delaviridine 32,450 Indinavir 34,445 Abacavir 32,744 Efavirenz32,601 Lopinavir 7,410 Stavudine 34,420 Nevirapine 34,738 Nelfinavir34,470 Zalcitabine 34,539 Ritonavir 34,502 Didanosine 34,227 Saquinavir34,543 Tenofovir 14,591 Zidovudine 33,5751.2 Dataset

The used dataset consists of a set of matching genotype/phenotypemeasurements with possible multiple phenotype measurements per genotype.The mutations are defined relative to HXB2 at amino acid level. Thephenotypes are presented as pFR values, which is equal to −log (FR),where FR denotes the Fold Resistance.

Negative pFR values denote resistance and positive values denotehyper-susceptibility. For example, a pFR value of −1.0 is equal to10-fold resistance.

1.3 Linear Regression

For example, consider the following (artificial) model: TABLE 2 MutationCoefficient 84V −0.46 50V −0.92 54M −0.64 88S 0.63 90M −0.16 46L −0.1946L & 84V −0.09

Consider a virus with following mutations: 3I, 46L, 84V and 90M. Thisvirus will have a pFR prediction:pFR=β_(46L).1+β_(84V).1+β_(90M).1+β_(46L,84V).1=−0.9or almost 8-fold resistance. Note that in the model 46L and 84V aresynergistic since their co-occurrence decreases the pFR by an extra−0.09. Note that in the model the mutation 3I shows no resistancecoefficient assigned and therefore it is not considered in the pFRprediction equation.

FIG. 9 shows an example of four different genotypes (mutations relativeto HBX2), whilst FIG. 10 shows an example of phenotype analysis for RTVperformed according to the method of the invention.

1.4 Model Creation

Using our facilities, it was computationally infeasible to calculate afull second order model on all possible mutations and second orderterms, since the number of terms increases quadratically with the numberof mutations considered.

E.g. for APV:

-   -   Total number of occurring mutations and couples of mutations:        19,074    -   mutations and couples with each at least 20 measurements: 4,107

In order to reduce the amount of terms, a first order regression wasperformed from the list of mutations that occur at least 20 times in thedataset. The significant terms from this regression were withheld andthe list of these terms—except mutation 3I for some of the PI's, wasused to perform a second order regression. A term is called significantif the probability that the real value of the term is 0, is smaller than0.001. A mutation must occur at least 20 times and the inverse also hasto be true: the count of viruses not having the mutation 3I or thecouple not 3I and another mutation should also be at least 20. Takingthis into account results in excluding 3I from the regression inpractice. In the second order regression only the single mutations andcouples of mutations are used that were significant in the first ordermodel.

1.5 Example of Model Creation: Indinavir

A first order regression is performed on the matching geno/pheno dataset(34,445 measurements of which 28,480 unique Virco IDs) for thosemutations that occur at least 20 times. The resulting first order modelwithholds a list of 94 single mutations that are considered significant.This list (except 3I) is used as a starting list for a second orderregression. In this second order regression, all the single mutationsand all couples of mutations from the list are used as potential terms.The significant terms are withheld by the regression algorithm.

1.6 The Impact of Cross-drug Correlation on the Significance Level ofMutations

Correlation between mutations that cause resistance to different drugs,has an impact on the confidence of the coefficient for this mutation.One of the effects is that for non-nucleoside reverse transcriptaseinhibitors (NNRTIs) and nucleoside reverse transcriptase inhibitors(NRTIs), some non-relevant mutations for that drug appear as significant(though with a coefficient close to 0), because drug resistance to thedrug is correlated with drug resistance to drugs that bind at adifferent place.

Note that this is only a problem for interpretation of the model. Forprediction of the Fold Resistance, the resulting model remains a goodpFR predictor.

1.7 Effects of Second Order Terms

Antagonism TABLE 3 Parameter pFR shift Count . . . . . . . . . 82A & 84V 0.43  395 . . . . . . . . . 82A −0.27 4845 . . . . . . . . . 84V −0.263531 . . . . . . . . .Second order terms can indicate a synergy or an antagonism. In theexample above, the occurrence of either 82A or 84V cause a resistanceshift, but the co-occurrence of both mutations almost completely cancelsout the effect of both mutations and shifts to susceptible ranges. Incase both mutations are present, the net pFR shift is only −0.43, whileit is −0,26 or −0.27 if only one of the mutations are present. This isan example of strong antagonism.

Synergy TABLE 4 Parameter pFR shift Count . . . . . . . . . 24I −0.221022 . . . . . . . . . 24I & 73S −0.48  30 . . . . . . . . . 73S −0.202216 . . . . . . . . .In this example, 24I and 73S both cause a resistance shit but theirco-occurrence causes a strong extra shift towards resistance. When onlyone of the mutations is present, the pFR shift is −0.20 or −0.22, butthe presence of both mutations causes a pFR shift of −0.48. 24I and 73Sare thus strongly synergistic in this example.

Enhancement TABLE 5 Parameter pFR shift Count . . . . . . . . . 32I 0  821 . . . . . . . . . 32I & 82A −0.26 516 . . . . . . . . . 82A −0.274845  . . . . . . . . .32I by itself does not contribute to resistance, but it increases theresistance for an 82A mutation. 32I enhances the effect of the 82Amutation.

An example of the effects of higher order interactions is shown in FIG.11.

1.8 Highly Correlated Mutations

Highly correlated mutations (i.e. mutations that almost always co-occurin a strain) can affect the results of a regression analysis Forexample, when one of these 2 mutations has an effect on resistance andthe other mutation does not (this mutation might for example be highlycorrelated with the resistance mutation because it affects thereplication rate of the virus), the effect can be assigned to either oneof the mutations. Unless this is compensated for, the regression modelwill assign the effect to that mutation that reduces the predictionerror the most, which might not always be the mutation that isbiologically responsible for the effect. Due to the correlation, itwould otherwise not be possible to distinguish between these mutations.

Another effect that occurs due to correlation is when a mutation ishighly correlated with a pair of mutations in which the first mutationis present. TABLE 6 Parameter pFR shift Count . . . . . . 58N −1.47 108. . . . . . 58N & 77L  1.16 106 . . . . . . 77L 0   471 . . . . . .

In the above example, 108 samples have a 58N mutation and out of these,106 samples also have a 77L mutation. The effect of a pure 58N mutationcan only be derived from the samples that have 58N and do not have 77L,which leads to higher uncertainty on the estimated pFR shift of the 58Nmutation. The couple-term ‘58N & 77L’ will compensate for a too lowestimation of 58N by having a too high estimation for its pFR shift.

Techniques are provided in the description to deal with these effects.The algorithm developed by the inventors tracks the change in pFR as theeffects of individual mutations or combinations of mutations are removedfrom the dataset.

A comparison of the change in the global average pFR with the change inthe average pFR for selected mutations with increasing iterations of thealgorithm is shown in FIG. 7. FIG. 8 shows an example, where the averagepFR for 71I (unwanted correlation) jumps up as a result of removing fromthe dataset virus strains that have “71I & 84V” and “48V” mutations.

Example 2 Illustration of a Stepwise Regression with Amprenavir

In a dataset of 31,292 matching genotypes and phenotypes for amprenavir;a stepwise first order regression was performed, and the first 11iterations are shown here. The inventors selected as p-values:

p-value entry=0.001

p-value stay=0.5

After the first iteration all the variables with mutations found morethan 20 times were selected. In the next step the p-value wascalculated, and then the algorithm picked the variable with the lowestp-value. In this case it was P084V, and then a model was built with thisvariable (84V). The model is shown in Table 7.

Variable P084_V Entered: R-Square =0.2477 and C(p)=43445.43 TABLE 7Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 1 2042.55324 2042.55324 10301.6 <.0001 Error 31289 6203.836500.19828 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept −0.00318 0.002690.27810 1.40 0.2363 P084_V −0.81754 0.00805 2042.55324 10301.6 <.0001

In the next iteration, we repeated the previous process except that theinfluence of 84V is now removed. In this second run, the mutation withthe lowest p-values was P082A. The model is shown in Table 8 below.

Variable P082_A Entered: R-Square=0.3538 and C(p)=32908.89 TABLE 8Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 2 2917.39937 1458.69968 8564.44 <.0001 Error 31288 5328.990380.17032 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.07033 0.00269116.28598  682.75 <.0001 P082_A −0.47905 0.00668 874.84612 5136.47<.0001 P084_V −0.82827 0.00747 2095.67091 12304.3 <.0001

In the next iteration, we repeated the previous process except that theinfluence of 84V and 82A is now removed. In this third run, the mutationwith the lowest p-values was P090M.

It may happen that alter some iterations, a mutation which was foundfirst significant is not significant anymore, and as such it is removedfrom the model.

The resistance coefficient is adjusted after every iteration because ofthe addition of a new variable to the regression. E.g. for P84V, theresistance coefficient β changes from −0.81754 to −0.82827.

The following iterations were done in the same way. The results obtainedare here below enclosed.

Variable P090_M Entered: R-Square=0.4245 and C(p)=25885.06 TABLE 9Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 3 3500.64130 1166.88043 7692.82 <.0001 Error 31287 4745.748440.15168 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.13680 0.00276373.40964 2461.75 <.0001 P082_A −0.40431 0.00642 601.20353 3963.52<.0001 P084_V −0.64830 0.00762 1097.72370 7236.89 <.0001 P090_M −0.335490.00541 583.24194 3845.10 <.0001

Variable P033_F Entered: R-Square=0.4729 and C(p)=21082.65 TABLE 10Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 4 3899.47454 974.86863 7016.41 <.0001 Error 31286 4346.915210.13894 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.14140 0.00264398.56350 2868.58 <.0001 P033_F −0.60263 0.01125 398.83323 2870.52<.0001 P082_A −0.35755 0.00621 460.87758 3317.07 <.0001 P084_V −0.607720.00733 954.28785 6868.28 <.0001 P090_M −0.30702 0.00521 483.383153479.05 <.0001

Variable P046_I Entered: R-Square=0.5110 and C(p)=17296.94 TABLE 11Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 5 4213.90735 842.78147 6538.51 <.0001 Error 31285 4032.482400.12890 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.15801 0.00257489.12701 3794.77 <.0001 P033_F −0.59310 0.01084 386.19290 2996.18<.0001 P046_I −0.32298 0.00654 314.43281 2439.45 <.0001 P082_A −0.327210.00601 381.95629 2963.31 <.0001 P084_V −0.53666 0.00721 714.536405543.55 <.0001 P090_M −0.25737 0.00511 326.53882 2533.37 <.0001

Variable P047_V Entered: R-Square=0.5287 and C(p)=15544.70 TABLE 12Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 6 4359.53390 726.58898 5848.07 <.0001 Error 31284 3886.8558410.12424 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.15941 0.00252497.68905 4005.73 <.0001 P033_F −0.55617 0.01069 336.13339 2705.42<.0001 P046_I −0.27816 0.00655 223.90478 1802.13 <.0001 P047_V −0.636710.01860 145.62656 1172.10 <.0001 P082_A −0.32763 0.00590 382.937223082.13 <.0001 P084_V −0.54696 0.00708 740.88446 5963.13 <.0001 P090_M−0.25573 0.00502 322.35883 2594.56 <.0001

Variable P046_L Entered: R-square=0.5409 and C(p)=14326.32 TABLE 13Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 7 4460.84208 637.26315 5266.21 <.0001 Error 31283 3785.547670.12101 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.16438 0.00249526.67853 4352.36 <.0001 P033_F −0.54921 0.01056 327.61200 2707.32<.0001 P046_I −0.31345 0.00658 274.55948 2268.90 <.0001 p047_L −0.277690.00960 101.30817 837.19 <.0001 P047_V −0.63940 0.01835 146.857061213.60 <.0001 P082_A −0.26575 0.00620 222.00011 1834.56 <.0001 P084_V−0.52972 0.00702 689.88420 5701.06 <.0001 P090_M −0.24267 0.00498287.89278 2379.09 <.0001

Variable P050_V Entered: R-Square=0.5534 and C(p)=13094.82 TABLE 14Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 8 4563.24026 510.40503 4844.61 <.0001 Error 31282 3683.149480.11774 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.16629 0.00246538.66106 4575.00 <.0001 P033_F −0.51815 0.01046 288.64295 2451.52<.0001 P046_I −0.29956 0.00651 249.44262 2118.58 <.0001 P046_L −0.279340.00947 102.51249 870.67 <.0001 P047_V −0.64754 0.01811 150.583081278.94 <.0001 P050_V −0.82797 0.02808 102.39819 869.70 <.0001 P082_A−0.25944 0.00612 211.32860 1794.87 <.0001 P084_V −0.53915 0.00693713.14511 6056.94 <.0001 P090_M −0.24415 0.00491 291.39382 2474.89<.0001

Variable P054_M Entered: R-Square=0.5643 and C(p)=12005.72 TABLE 15Analysis of Variance Sum of Source DF Squares Mean Square F Value Pr > FModel 9 4653.81644 517.09072 4502.38 <.0001 Error 31281 3592.573300.11485 Corrected Total 31290 8246.38974 Parameter Variable EstimateStandard Error Type II SS F Value Pr > F Intercept 0.16633 0.00243538.92765 4692.51 <.0001 P033_F −0.46793 0.01049 228.56387 1990.14<.0001 P046_I −0.30134 0.00643 252.40077 2197.69 <.0001 P046_L −0.285080.00935 106.71653 929.19 <.0001 P047_V −0.55613 0.01818 107.50855 936.09<.0001 P050_V −0.84757 0.02774 107.23423 933.70 <.0001 P054_M −0.571190.02034 90.57618 788.66 <.0001 P082_A −0.26050 0.00605 213.05431 1855.09<.0001 P084_V −0.53337 0.00685 697.31870 6071.64 <.0001 P090_M −0.234540.00486 267.57375 2329.80 <.0001

TABLE 16 Analysis of Variance Source DF Sum of Squares Mean Square FValue Pr > F Model 10 4733.53615 473.35362 4214.95 <.0001 Error 312803512.85359 0.11230 Corrected 31290 8246.38974 Total Parameter VariableEstimate Standard Error Type II SS F Value Pr > F Intercept 0.167560.00240 546.67520 4867.84 <.0001 P033_F −0.40056 0.01068 158.096301407.76 <.0001 P046_I −0.30750 0.00636 262.46469 2337.10 <.0001 P046_L−0.28311 0.00925 105.23686 937.08 <.0001 P047_V −0.52353 0.0180294.83277 844.43 <.0001 P050_V −0.87024 0.02744 112.94107 1005.68 <.0001P054_L −0.45611 0.01712 79.71971 709.86 <.0001 P054_M −0.61432 0.02018104.09593 926.92 <.0001 P082_A −0.26465 0.00598 219.74203 1956.68 <.0001P084_V −0.51838 0.00679 654.14626 5824.81 <.0001 P090_M −0.22560 0.00482246.35798 2193.68 <.0001

TABLE 17 Analysis of Variance Source DF Sum of Squares Mean Square FValue Pr > F Model 11 4810.94944 437.35904 3982.07 <.0001 Error 312793435.44030 0.10983 Corrected 31290 8246.38974 Total Parameter VariableEstimate Standard Error Type II SS F Value Pr > F Intercept 0.162390.00238 510.05266 4643.93 <.0001 P033_F −0.39933 0.01056 157.118891430.54 <.0001 P046_I −0.32619 0.00633 291.69184 2655.80 <.0001 P046_L−0.29089 0.00915 110.98928 1010.54 <.0001 P047_V −0.50941 0.0178289.70713 816.77 <.0001 P050_V −0.86163 0.02714 110.70009 1007.90 <.0001P054_L −0.46062 0.01693 81.29492 740.17 <.0001 P054_M −0.61674 0.01995104.91704 955.25 <.0001 P082_A −0.25645 0.00592 205.77424 1873.53 <.0001P084_V −0.51212 0.00672 637.66342 5805.80 <.0001 P088_S 0.57942 0.0218277.41329 704.83 <.0001 P090_M −0.22099 0.00477 236.06351 2149.31 <.0001

Example 3 Dealing with Censored Values

In this example, the method developed by the inventors to deal withcensored values was applied for the drug amprenavir. Firstly, a linearregression model without censored values was calculated, see the values‘APV_pFR’ for iteration 0 in the Tables 18-21 below. The phenotypicmeasured value is <−2.083062017 for Virus strain 1, for the firstiteration, the value V₀ is equal to the measured phenotype valueAPV_pFR, but the censor is considered as ‘=’. This was followed byiteration nr. 1. Once the values were obtained from iteration 1, theinventors compared the prediction value P, e.g. for Virus strain 1,−2.362076895 with the phenotypic measured value <−2.083062017. For virusstrains 1, 2 and 3 below, the measured values have a censor ‘<’. Forvirus strains 4, the measured value has a censor ‘>’. The scenariosestablished by the inventors, when the case was ‘<’-censor, and when thecase was ‘>’-censor, were applied. New linear regression models werecalculated and for the censored values in the linear regression model,either the data-points from the training set were removed, as for virusstrains 1, 2 and for virus strain 4 (iteration 2 and 3 only), or thosedata-points were used instead of the censored phenotypes measurement, asillustrated for virus strain 3 and for virus strain 4 (only iteration1). The procedure was reiterated until the prediction converged.Virus strain 1; CASE ‘<’and P<V−0.798 σV−0.798 σ=−2.282562017

TABLE 18 ITER APV_censor APV_pFR Prediction P value 0 < −2.083062017−2.083062017 1 < −2.083062017 −2.362076895 2 < −2.083062017 −2.5373581883 < −2.083062017 −2.550836421Virus strain 2; CASE ‘<’and P<V−0.798 σV−0.798 σ=−2.108607374

TABLE 19 ITER APV_censor APV_pFR Prediction P value 0 < −1.909107374−1.909107374 1 < −1.909107374 −2.608207782 2 < −1.909107374 −2.7439365053 < −1.909107374 −2.748739259Virus strain 3; CASE ‘<’and P<V−0.798 σV−0.798 σ=−2.108607374

TABLE 20 ITER APV_censor APV_pFR Prediction P value 0 < −1.956716334−1.956716334 1 < −1.956716334 −1.343253355 −2.016673816 2 < −1.956716334−1.401494738 −2.021216328 3 < −1.956716334 −1.405764249 −2.02157333Virus strain 4; CASE ‘<’and P<V=0.798 σV=0.798 σ=−0.71428642

TABLE 21 ITER APV_censor APV_pFR Prediction P value 0 > 0.5154286420.515428642 1 > 0.515428642 0.671585565 0.714928642 2 > 0.5154286420.759236384 3 > 0.515428642 0.770724812

Example 4 Mutations Trajectories

A methodology for removing unwanted correlations involved an algorithmdeveloped by the inventors in which the change in pFR was tracked as theeffects of individual mutations or combinations of mutations wereremoved from the dataset. The effect of each mutation or combination ofmutations was separated out. The methodology followed mutationtrajectories towards the global average as the effects of individualmutations or combinations of mutations were removed.

In a first step, the average pFR was calculated for all mutations with asufficient count in the database to be significant, i.e. >20. In theTable below the first and last 10 mutations of a list of 368 mutationsis shown with the corresponding calculated average pFR. TABLE 22 LabelCount Average pFR 54M 239 −1.35421631044816 50V 129 −1.2886867154532947V 281 −1.27082426054035 84A 21 −1.23536347179756 76V 187−1.20328089469875 89V 269 −1.19273200093637 91S 36 −1.15103609689224 33M22 −1.12268052165089 84C 21 −1.11529918764429 54L 327 −1.09959239280592. . . 18L 65 0.12143300387582 69K 1387 0.121561573218861 89M 12610.134344491996282 33V 406 0.140725379487564 17E 86 0.154734960149477 63I20 0.159585638802379 62M 22 0.184594303565984 19S 21 0.20279897375220315L 51 0.213932584381636 88S 204 0.494820688128991

The extremes were determined, and the mutation with the pFR furthestaway from the global average was selected. In Table 22, the mutationselected was 54M.

Following, all virus strains that had the selected mutation wereremoved, in total it amounted to 283 virus strains: from 26738 to 26455.A new

Table 23 of average pFR was generated for the remaining mutations. Thelist below shows the first and last 10 mutations of the obtainedresults. TABLE 23 New table of average per mutation: Label Count AveragepFR 50V 129 −1.28868671545329 84A 21 −1.23536347179756 76V 168−1.13676536523909 47V 215 −1.12849822159876 91S 35 −1.12648669210251 33M22 −1.12268052165089 84C 21 −1.11529918764429 54L 327 −1.0995923928059289V 213 −1.05990043312845 33F 852 −0.988257531387902 . . . 18L 650.12143300387582 69K 1384 0.125380073379929 89M 1256 0.14113693899152533V 404 0.147380396015872 17E 86 0.154734960149477 63I 200.159585638802379 62M 22 0.184594303565984 19S 21 0.202798973752203 15L51 0.213932584381636 88S 202 0.494762597743788

The previous steps were reiterated. In Table 24 there is listed themutations which were selected at different iteration counts. TABLE 24Count Mutation selected 26738 54M 26455 50V 26296 84A 26272 91S 2622047V 25947 76V 25762 54L 25365 22V 25260 33F 24545 32I 24183 84V 2178454S 21728 82F 21577 54T 21427 24F 21397 24I 20894 73T 20702 73C 2060455R 20324 95F 20194 10R 20122 54A 20062 82C 20039 88S 19818 46L 1924723I 19190 58E 18909 67F 18877 73S 18105 53L 17920 10F 17568 82A 1663746I 16215 54V 16028 35G 15996 90M

The procedure was stopped when the last selected mutation had an averagepFR that approximated to the global average. In FIG. 17, a histogram ofthe population left after removing all virus strains during theiterations is shown. In FIG. 18, the trajectory of mutation 18H, whichhas no significant phenotype, demonstrates the underlying cause that avirus strain is resistant due to other mutations than 18H (54M, 76V,33F, 84V).

Example 5 Highly Correlated Mutations

A methodology for removing unwanted correlations proceeded as follows.In a first stage, the correlation coefficient between all mutations(with a sufficient count in the database, i.e. >20) and the pFR wascalculated. In Table 25 below there is listed the first and last 10coefficients of a list of 202 coefficients that were calculated. TABLE25 NAME correlation with APV_pFR P084_V −0.51430323209603 P090_M−0.510166810424169 P010_I −0.43843660921923 P046_I −0.430221405121849P071_V −0.400337497273138 P033_F −0.385947043862637 P082_A−0.343946685713202 P054_V −0.339269018202483 P032_I −0.275105400287351P010_F −0.263412991920486 . . . P065_D 0.044821230615045 P012_A0.0476139542192145 P014_R 0.0485859961407927 P033_V 0.0571605219086003P041_K 0.0713497764580897 P063_S 0.071416853283617 P030_N0.080199177466301 P089_M 0.0904859904655573 P069_K 0.0926109439533826P088_S 0.101920449277774

Consequently, the extremes were determined (maximum, minimum), and themutation with the highest (absolute value of) correlation coefficientwas selected. In this case was P084_V.

In the following step, a linear model for the pFR with the selectedmutation(s) (from step 2, all previous iterations) was calculated. Thepredicted model obtained was Predicted pFR=−0.844 * M₈₄

Following, the residue was taken (pFR minus the predicted value from themodel). In FIG. 19 a graph of the residues as a function of the measuredvalues is shown. In FIG. 20, the same graph is represented in the formof histograms where the distribution of the residue may be observed.

Then, the correlation coefficient between all mutations with asufficient count in the database and the residue was calculated. Resultsof the first and last 10 variable are shown here below. It will beobserved that the order of the mutations had changed because theinfluence of mutations P084_V had been removed. TABLE 26 NAMECorrelation with Residue P082_A −0.402088216120049 P090_M−0.373018037350597 P033_F −0.347337204147266 P032_I −0.325689507827784P046_I −0.324078296352946 P010_I −0.321343698855709 P054_V−0.311246000758214 P071_V −0.290388889036061 P047_V −0.272571613897309P046_L −0.234148912637279 . . . P012_S 0.0358242687063381 P014_R0.0401577992704191 P012_A 0.0442436406740132 P063_S 0.0484345347109348P033_V 0.0537012592855092 P030_N 0.0548222080117112 P041_K0.0659977808671604 P089_M 0.078693680117922 P069_K 0.0835173137603877P088_S 0.110781153455097

Consequently, the extremes were determined again, and the mutation withthe highest absolute value of the correlation coefficient was selected.The mutation selected now was P082_A.

A new linear model for the pFR with the selected mutation P082_A, wascalculated.pFR=−0.782*M84+−0.435*M82

After 6 iterations, the following resistance coefficients for a selectedgroup of mutations was obtained. TABLE 27 Parameter β P084_V−0.512345233450494 P082_A −0.247464800298171 P090_M −0.156011849225004P033_F −0.532092050020052 P046_I −0.205719880528639 P047_V−0.656374746817602

In FIG. 21, a graph is represented in the form of histograms showing theresidues after the 6 parameters were taken into account.

The reiterative procedure continued until the last selected mutation hada correlation coefficient that approximated to zero.

Example 6 Results from Linear Regression Modeling

In an initial test, genotypes and corresponding phenotypes determinedfor ritonavir (RTV) for 28,540 HIV-1 clinical isolates were used. Thelinear regression analysis identified 20/22 RTV resistance-associatedmutations described in the IAS mutation list (all except 10F and 77I)(see FIG. 13). Additional mutations whose effect on RTV susceptibilityhad been previously described (e.g. 73S/T/C, 84A/C and 88D) were alsoidentified (FIG. 14). Overall, 53 single mutations and 96 pairs ofmutations were identified as having significant effect on susceptibilityto RTV.

The predicted phenotype was compared to the measured phenotype in aleave-one-out cross-validation, demonstrating a root mean square errorof 0.31 (logFR) (see FIG. 15. The error rate of the linear modelingmethod [5.62% (sensitivity=93.0%, specificity=95.4%)], comparedfavourably to a decision tree-based model [Beerenwinkel, PNAS 99, (2002)8271-8276] [10.2% (sensitivity=89.8%, specificity=89.7%)] (see FIG. 16).

The robustness of the algorithm as a function of the size of the inputdataset was assessed using smaller subsets of data. Nine of 22 IASresistance-associated mutations for RTV could be identified with subsets≧5% (1600 isolates) of the original data. However, the accuracy of thepredicted contribution of the mutations improved with increasing datasetsizes up to 50% of the original database (median standard error of thepredicted contributions decreased 50%). Some secondary mutations (e.g.10R, 32I, 82S) were identified as having a significant contribution toresistance only when the subset size reached a similar 50% level.

Example 7 Comparison of Genotype-to-phenotype Prediction for DifferentArtificial Intelligence Techniques

Analyses were performed on matching genotype/phenotype datasets for all16 HIV inhibitors currently available. The matching genotype/phenotypedatasets consisted of approximately 30,000 data points for most drugsexcept Lopinavir and Tenofovir.

As an example, the following results for Ritonavir were obtained:

-   -   A log(FR) root mean squared error of 0.31 (MSE=0.096).    -   A classification error of 5.6% (with regard to a standard cutoff        applied by the inventors of FR=3.5).    -    (sensitivity=93.0%, specificity=95.4%).

Regression model identified 53 single mutations and 96 pairs ofmutations. 20 out of 22 mutations of the IAS list are confirmed in thismodel. TABLE 28 Model CV-MSE* Classification inspection r² (in log(FR))error Nr of samples Coverage possible Neural network 0.88 n/a n/a 1,322100% N (Lopinavir)^(1,2) Support Vector 0.79 0.176 n/a 652 100% NMachine (Ritonavir)⁴ Support Vector 0.81 0.144 n/a 17,453 100% NMachines (Tibotec data, Ritonavir) Decision tree Ctassification 10.2%469 100% Y Ritonavir)³ only Clustering n/a n/a 15.3% 1,152 100% N(Indinavir)⁵ Self Organizing Classification   15% 811  84% N Map(Saquinavir)² only (38 matching) Linear regression 0.88 0.096  5.6%34,502 100% Y modeling (Virco data, Ritonavir)*CV-MSE: cross-validation mean squared error. Depending on the analysis,it is a leave-one-out or a 10-fold cross-validation¹A 28-Mutation Neural Network Model that Accurately Predicts PhenotypicResistance to Lopinavir (LPV) D Wang, R Harrigan and BA Larder,Antiviral Therapy 2001 (Supplement 1): 105²Predicting HIV Drug Resistance With Neural Networks S Dr{hacek over(a)}ghici, R Potter, Bioinformatics, Vol. 19 no. 1, 2003 (p. 98-107).³Geno2pheno: Interpreting Genotypic HIV Drug Resistance Tests NBeerenwinkel et al., Intelligent Systems in Biology, November/December2001.⁴Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypesN Beerenwinkel et al., Nucleic Acids Research, 2003, Vol. 31, No 13.⁵Predicting phenotype from genotype: a comparison of statistical methodsA Foulkes et al.

Neural networks, support vector machine and clustering techniques are nosuitable candidates for a quantitative prediction system with therequirement that such a system should have a descriptive power in orderto be inspectable by experts and to be able to motivate certainpredictions to customers. Decision trees allow experts to have insightin how the decision process works. But decision trees require massiveamounts of data to build complex trees, since the information in thedata is not optimally used.

Based on the performed analyses and the linear regression modellingfeasibility study, linear regression modelling seems to meet therequirements for a quantitative prediction system. Linear regressionseems to outperform other examined techniques and offers the possibilityof inspection to HIV-experts. Considering the lower error rates, thelinear regression modelling allows the optimisation of drug therapy inpatients.

Example 8 Comparison of the Regression Linear Model with a Rules-basedAlgorithm

The regression linear model developed by the inventors was used to testthe drug-associated resistance of one HIV-1 sample on nevirapine,delavirdine, and efavirenz. In parallel, the same sample was run in arules-based algorithm methodology as described in WO01/79540.

The results were expressed in FR, or fold change in IC50 or EC50,relative to reference wild-type virus, which is the drug concentrationat which 50% of the enzyme activity is inhibited, and is expressed in μMunits. The FR cutoffs values for normal susceptible ranges were 8, 10,and 6, for nevirapine, delavirdine, and efavirenz, respectively.

The phenotypic antiviral experiment was taken as the gold-standard bythe inventors. Said phenotypic antiviral experiment was performed asdescribed in WO97/27480. Results of the three methodologies are enclosedin Table 3 below.

According to the results obtained by the rules-based algorithm, thepatient's sample would be susceptible to all three drugs. However, whenthe results obtained by the linear regression model were considered, thesample showed resistance against nevirapine and efavirenz andsusceptibility against delavirdine. These last results were confirmed bythe phenotypic antiviral experiments. TABLE 29 Linear PhenotypicRules-based algorithm regression antiviral Matches in model experimentdrug database FR FR FR Nevirapine 321 3.6 176.1 >89 delavirdine 301 1.93.5 3.7 efavirenz 300 1.9 12.9 17.5

1. A method for quantitating the individual contribution of a mutationor combination of mutations to the drug resistance phenotype exhibitedby HIV, said method comprising the step of performing a linearregression analysis using data from a dataset of matching genotypes andphenotypes, wherein the log fold resistance, pFR, of each HIV strain ismodelled as the sum of all the individual resistance contributions foreach of the mutations or combinations of mutations that occur in HIVaccording to the following equation;pFR=β_(A)M_(A)+β_(B)M_(B)+β_(n)M_(n)+ . . . +β_(Z)M_(Z)+ε  wherein eachindividual resistance contribution is calculated by multiplying amutation factor, M_(A), M_(B), . . . , M_(Z), for each mutation orcombination of mutations by a resistance coefficient β_(A), β_(B), . . ., β_(Z); wherein for a combination of mutations, the mutation factorM_(n) represents the co-occurrence of one mutation with other one ormore mutations and the coefficient β_(n) represents the synergy orantagonism between the one mutation with the other one or moremutations; wherein the mutation factor assigned to each mutation orcombination of mutations reflects the degree to which that mutation orcombination of mutations is present in the HIV strain and, if present,to which degree the mutation is present in a mixture; wherein eachresistance coefficient reflects the contribution of the mutation orcombination of mutations to the fold resistance exhibited by the strain;and wherein the error term ε, represents the difference between themodelled prediction and the experimentally determined measurement.
 2. Amethod according to claim 1, wherein correlations are removed from thedataset for correlated mutations where not all correlated mutationscontribute to the drug resistance phenotype, using an algorithm to trackthe change in pFR for each mutation as the effects of individualmutations or combinations of mutations are removed from the dataset. 3.A method according to claim 2, wherein the algorithm performs thefollowing steps: a) calculate average pFR for all mutations with asufficient count in the database to be significant; b) determine theextremes (maximum, minimum), and select the mutation with the pFRfurthest away from the global average; c) remove all virus strains thathave the selected mutation from the dataset and reiterate from step a);d) stop when the selected mutation in step b) has an average pFR thatapproximates to the global average; such that removing virus strainswith a certain resistance causing mutation results in an increase of theaverage pFR for correlating mutations, which thus have a higher averagepFR.
 4. A method according to claim 1, wherein the algorithm performsthe following steps: a) calculate correlation coefficient between allmutations with a sufficient count in the database and the pFR; b)determine the extremes (maximum, minimum), and select the mutation withthe highest absolute value of correlation coefficient; c) calculate alinear model for the pFR with the selected mutation(s) (from step b),all previous iterations); d) take the residue; e) calculate correlationcoefficient between all mutations with a sufficient count in thedatabase and the residue; f) determine the extremes (maximum, minimum),and select the mutation with the highest absolute value of correlationcoefficient; g) calculate a linear model for the pFR with the selectedmutation(s) (from step f), all previous iterations); and h) reiteratefrom step d); i) stop when the selected mutation in step g) has acorrelation coefficient that approximates to zero.
 5. A method accordingto claim 1, wherein censored values in the genotype/phenotype databaseare replaced by a maximum likelihood estimation.
 6. A method accordingto claim 5, wherein for each iteration of the linear regression, thefollowing steps are performed until the predictions converge: a)Calculate a linear regression model without censored values; b) Use thephenotypic measured value V₀ as if the censor was “=”, e.g. when aresult is expressed as −log FR<4, we will treat V₀ as −log FR=4; c) Lookat the prediction P from the model and apply either: When case‘<’-censor: P<V₀−0.798σ (center of gravity of half Gaussiandistribution) Remove value from training data for next iterationV₀−0.798σ ≦P<V₀ Use V′=V₀−0.798σ for next iteration V₀≦P Use V′ centreof gravity of tail (<V) of a normal distribution N (P, σ) as value fornext iteration. When case ‘>’-censor: P>V₀+0.798σ (center of gravity ofhalf Gaussian distribution) Remove value from training data for nextiteration V₀+0.798σ ≧P>V₀ Use V′=V₀−0.798σ for next iteration V₀≧P UseV′ centre of gravity of tail (>V) of a normal distribution N (P, σ) asvalue for next iteration. d) Calculate a linear regression model and forthe censored values in the linear regression model, either remove thedata-point from the training set, or use V′ instead of the censoredphenotypes measurement, as-described in step c); e) Re-iterate from stepb) until prediction converges.
 7. A method of identifying a mutationthat effects the degree of drug resistance exhibited by an HIV strainusing a method according to claim
 1. 8. A method according to claim 1wherein the contribution of a mutation pattern to the drug resistancephenotype exhibited by an HIV strain is calculated, said methodcomprising the steps of: a) obtaining a genetic sequence of said HIVstrain, b) identifying the pattern of mutations in said geneticsequence, wherein said mutations are associated with resistance orsusceptibility to drug therapy, and c) calculating the fold resistanceof the HIV strain as compared to the wild type HIV strain by performinga linear regression analysis according to claim
 1. 9. A method accordingto claim 8, which incorporates a method according to claim
 1. 10. Amethod according to claim 1, wherein in the case of small datasets forparticular mutations or combinations of mutations, the method is appliedrecursively to the set of virus strains that exhibit those particularmutations or combinations of mutations.
 11. A diagnostic method foroptimising a drug therapy in a patient, comprising performing a methodaccording to claim 7 for each drug or combination of drugs beingconsidered to obtaining a series of drug resistance phenotypes andtherefore assess the effect of the plurality of drugs or drugcombinations on the predicted fold resistance exhibited by the HIVstrain with which the patient is infected and selecting the drug or drugcombination for which the HIV strain is predicted to have the lowestfold resistance.
 12. (canceled)
 13. Use of a method according to claim 1for assessing the efficiency of a patient's therapy or for evaluating oroptimizing a therapy.
 14. A diagnostic system for quantitating theindividual contribution of a mutation or combination of mutations to thedrug resistance phenotype exhibited by an HIV strain, said systemcomprising: a) means for obtaining a genetic sequence of said HIVstrain; b) means for identifying the mutation pattern in said geneticsequence as compared to wild type HIV; c) means for predicting the foldresistance exhibited by the HIV strain using the method of claim
 1. 15.A computer apparatus or computer-based system adapted to perform themethod of claim
 1. 16. A computer program product for use in conjunctionwith a computer, said computer program comprising a computer readablestorage medium and a computer program mechanism embedded therein, thecomputer program mechanism comprising a module that is configured sothat upon receiving a request to quantify the individual contribution ofa mutation or combination of mutations to the drug resistance phenotypeexhibited by HIV, or to calculate the quantitative contribution of amutation pattern to the drug resistance phenotype exhibited by an HIVstrain, it performs a method according to claim 1.